Contents ButtonIndex Button

WebCD Globe HTML Tags Processed by Packager


WebCD Packager relies on the user to enter the initial starting URL. After the starting URL, WebCD Packager can discover URLs in the following ways:

WebCD Packager parses the following HTML tags out of a downloaded HTML file:

Base tag information is used to resolve relative URLs in the page:

<BASE HREF = "url">

This tag is removed unless there are attributes other than HREF in the tag. In this case the HREF attribute is removed from the tag.

Form Actions
Form actions are always added as links and default to Do Not Retrieve:

<FORM ACTION = "url">

Note The form itself is retrieved.

URLs Beneath the Path of the Starting URL
The following URLs are marked for retrieval and retrieved if they fall within the path prefix of an URL that is marked for retrieval (i.e., the parent is marked for retrieval); otherwise, they default to Do Not Retrieve:

<A HREF = "url">
<AREA HREF = "url">
<EMBED SRC = "url1" PLUGINSPAGE = "url2">
Note url1 defaults to Retrieve. url2 defaults to Retrieve if its parent is Retrieve.
<FRAME SRC = "url">
<IFRAME SRC= "url">
<LINK HREF = "url">
<META HTTP-EQUIV = "Refresh" CONTENT = "n; URL = "url"">
<OBJECT CODE = "url1" CODEBASE = "url2">
Note url1 and url2 are combined to form one URL. Neither has to exist. If only one exists, then that URL is used.
<SCRIPT SRC= "url">

Exception Server side image maps default to Do Not Retrieve. The img-url is retrieved (see tag below).

<A HREF="map-url"><IMG SRC="img-url" ISMAP></A>
Server side image map conversion option in WebCD Packager. The absence of a USEMAP attribute (as shown in the tag directly above), indicates a server side image map. When WebCD Packager detects this situation, an URL for the image map file is automatically added to the retrieval specification (Retrieval View). The URL added is a file URL pointing to the projectÆs Maps folder, using the map-urlÆs base name. See Converting Server Side Image Maps to Client Side for more details on server side image map conversion.

URLs that are Always Retrieved
The following URLs are always retrieved, whether or not they are within the path of the starting URL:

<BGSOUND SRC = "url">
<BODY BACKGROUND = "url">
<IMG SRC = "url" DYNSRC = "url" LOWSRC = "url">
<INPUT SRC = "url">
<TABLE BACKGROUND = "url">
<TD BACKGROUND = "url">
<TH BACKGROUND = "url">
<TR BACKGROUND = "url">

Title Text
Title text is stored as the pageÆs title. Leading white space, trailing white space, and non-printable characters are ignored:

<TITLE> and </TITLE>

Header Information
WebCD Packager uses HTTP header information to determine whether an URL is redirected. URLs that are redirected outside the path of the starting URL default to Do Not Retrieve.

Note More tags will be processed in future releases of WebCD Packager.

Related Topics

Converting Server Side Image Maps to Client Side
HTTP Redirection
Omitting/Inserting HTML in the Packager Process


Copyright 1996 MarketScape Inc., Colorado Springs, CO USA. All Rights Reserved.